RetroRanker: leveraging reaction changes to improve retrosynthesis prediction through re-ranking

Retrosynthesis is an important task in organic chemistry. Recently, numerous data-driven approaches have achieved promising results in this task. However, in practice, these data-driven methods might lead to sub-optimal outcomes by making predictions based on the training data distribution, a phenomenon we refer as frequency bias. For example, in template-based approaches, low-ranked predictions are typically generated by less common templates with low confidence scores which might be too low to be comparable, and it is observed that recorded reactants can be among these low-ranked predictions. In this work, we introduce RetroRanker, a ranking model built upon graph neural networks, designed to mitigate the frequency bias in predictions of existing retrosynthesis models through re-ranking. RetroRanker incorporates potential reaction changes of each set of predicted reactants in obtaining the given product to lower the rank of chemically unreasonable predictions. The predicted re-ranked results on publicly available retrosynthesis benchmarks demonstrate that we can achieve improvement on most state-of-the-art models with RetroRanker. Our preliminary studies also indicate that RetroRanker can enhance the performance of multi-step retrosynthesis. Supplementary Information The online version contains supplementary material available at 10.1186/s13321-023-00727-7.


Features used in RetroRanker
We show the complete list of features and their sizes below.

Training Settings of Augmented Transformer
We re-trained Augmented Transformer on the USPTO-full dataset. For each instance, we introduce 5 random SMILES as data augmentation to the Transformer model. We train Augmented Transformer model mainly following the settings of MolecularTransformer 2 , the difference is that we use 8 transformer layers and the hidden size is changed to 512. The model is trained on 8x Nvidia V100 32G GPU for about 120,000 steps. Our re-trained model outperforms the original Augmented Transformer [1], mainly because the model size is much larger.

RetroRanker with the AttentiveFP
We use two independent three-layer AttentiveFP networks to encode the reactant and product molecular graphs. The size of hidden features is 512 and the dropout is 0.2. The reaction representation is obtained by concatenating six readout channels on molecular graphs or masked molecular graphs, namely, the reactant molecular graph, the product molecular graph, the reactant molecular graph with reacted atoms as masks, the reactant molecular graph with atoms in the leaving groups as masks, the product molecular graph with reacted atoms as masks, and the bond information in reactant molecular graphs. The masked graphs also reveal potential reaction changes, where node representations correspond to masked atoms are preserved, while other node representations are set to 0. We use a two linear-layer neural network with reaction representations as input to obtain RetroRanker score for each reactants-product pair. RetroRanker is trained using the label smoothed (0.01) cross-entropy loss function. We use Adam [2] optimizer with a initial learning rate of 3e-4 and a weight decay of 1e-5. The batch size is 512. To prevent overfitting, we use valid data to early stop the training process.
On USPTO-50k, we trained the model with an Nvidia A100 80G GPU and it takes approximately 12 hours, while on USPTO-full is 40 hours.

RetroRanker with Graphormer
Similar to RetroRanker with AttentiveFP, we use two independent Graphormer encoders to encode the reactant and product molecular graphs. The settings of the Graphormer encoder follow the "graphormer base" 3 architecture. The reaction representation is obtained by concatenating the rep- resentation of "CLS" token from the reactant and the product, where "CLS" is a special token in Transformers-based models. On USPTO-full, we train the model with 500,000 steps on 8x Nvidia V100 32G GPUs. We report the results using the averaged model parameters of the last 10 checkpoints. On USPTO-50k, the model is trained with 50,000 steps. We report the results using the averaged model parameters of the last 5 checkpoints.
As shown in Table 4, RetroRanker achieves the best performance when re-ranking predictions with strategy S2, i.e., S2(100%, 2) on USPTO-50k. We find that the overall result of using re-ranking strategy S2 is comparable under different re-ranking ratios, provided that the re-ranking ratio p is above a certain threshold (e.g., 75%).
We mainly show the effect of the re-ranking parameters for the strategy S1 using RetroRanker over Augmented Transformer. Supplementary Figure 1 shows the accuracies when varying the re-ranking ratio (Suplementary Figure 1(a)) and the number of preserved top-ranked predictions (Supplementary Figure 1(b)). In Figure 1(a), when increasing the re-ranking ratio p, the performance is also improved. The overall results are comparable when the re-ranking ratio p is above 0.75. This also explains that the results of using S2 are comparable when the re-ranking ratio is above the threshold of 0.75. Figure 1(b) shows that RetroRanker achieves improved performance when varying the number of preserved top-ranked predictions. One interesting observation is that, if the top-1 prediction is preserved, the accuracy of the top-2 is the best among all settings, similarly, if the top-3 predictions are preserved, the accuracy of the top-4 is the best. This indicates that our re-ranking strategy is flexible, we can tune the number of preserved top-ranked predictions for certain performance requirements, e.g., if we would like to have the best overall accuracy at top-20, we can set the number of preserved predictions to 18 or 19.
We show an illustrative example in Supplementary Table 2 to describe the re-ranking process. Note that only predictions whose RetroRanker scores are among the bottom 50% are re-ranked in S1 (50%, 3), while the ranking of other predictions are preserved.

Results on various GNN backbones and strategies
Supplementary Table 3 and 4 show the full results of re-ranking on USPTO-50k and USPTO-full, respectively. With re-ranking strategy S2(100%, 2), on USPTO-50k, RetroRanker with Graphormer  Original rank RetroRanker score S1(50%, 3) * S1(100%, 3) † S2(100%, 3) +0.0 7 5 8 (9+5) 10 -4.0 10 10 10 (10+10) * The top-3 predictions are preserved in S1(50%, 3). Predictions whose Retro-Ranker scores are among the bottom 50% are highlighted in red. They are moved to the end of the list and will be re-ranked based on RetroRanker scores. The order of other predictions (highlighted in green) is preserved. † The top-3 predictions are preserved in S1(100%, 3). All other predictions are re-rank based on RetroRanker scores (highlighted in red). ‡ Numbers in parentheses denote the sum of the original rank and new rank in S1 (100%, 3). improves all the top-5 to top-9 accuracies by nearly 1%, and RetroRanker with AttentiveFP also improves the top-7 to top-10 accuracies by about 1%. On USPTO-full, the best improvement on Augmented Transformer is achieved using strategy S2(100%, 0) with Graphormer, which improves the top-3 to top-5 accuracy by more than 2%. Note that the results we report here are all trained based on predictions of R-SMILES. We tune the parameters (re-ranking ratio and the number of preserved predictions) on R-SMILES under S1 strategy, and find that it is difficult to achieve steady improvement. When using the strategy S2(100%, 0), we achieve improvement over all positions, and RetroRanker with Graphormer performs better than that with AttentiveFP. For example, the top-4 to top-7 accuracies are improved by nearly 1%.  .5(+0.5) * The re-ranking strategy is S1(50%, 2). "AF" is for AttentiveFP, and "GH" is for Graphormer, the abbreviations are also used in following tables. † The re-ranking strategy is S2(100%, 2).  The re-ranking strategy is S1(100%, 2). † The re-ranking strategy is S2(100%, 0). ‡ The re-ranking strategy is S1(50%, 2). ∆ The re-ranking strategy is S2(100%, 2).
On proposals predicted by RetroXpert, we performed additional experiments to verify the effectiveness of our model and features. Under backbones like WLN [3] or weave [4], the re-ranking performance is comparable with AttentiveFP. However, in the ablation study, the performance dropped significantly when removing change features, as shown in Supplementary Table 5. The improvement over rxn-ebm can be primarily attributed to the introduction of both molecular features and reaction change features, which are crucial for learning the representations of chemical reactions. * The re-ranking strategy is S1(90%, 0). 6 Re-ranking Augmented Transformer predictions with Retro-Ranker trained over R-SMILES We find that for Augmented Transformer, compared to RetroRanker trained on its own predictions, the improvement is more significant when re-ranking with RetroRanker trained using the R-SMILES data. Suppplementary Table 6 shows the results on re-ranking with RetroRanker trained under various settings. The results show that RetroRanker trained on R-SMILES can potentially be considered as a plug-and-play re-ranking plugin, or a pretrained ranking model that can be finetuned, to achieve improved performance.

Analysis on ranking after RetroRanker for Augmented Transformer's predictions
For predictions of Augmented Transformer on USPTO-full, after re-ranking with S2(100%, 0), the rankings of recorded reactants for 13, 088 product molecules are improved, while there are also 5, 527 recorded reactants experienced a decline. The average increase in ranking is 2.1, while the average decline in ranking is 1.7. This indicates that the overall improvement in ranking is more significant than the observed decline, as the extent of their improvement is greater than that of the reactants with decreased rankings.